Group Members: Yingying Qian, Xiao Xiao, Qingnan Wang, Hanchen Liu, Runfeng Zhang
Enviroment: Google Colab
Link: https://colab.research.google.com/drive/1gfdWe4DIwi8EOkcUI5ET_vemQLWRofCF
Suppose we are from a data scientist team at Warner Brothers, and are required to solve tasks as follows.
WB has released some movies recently. However, currently we do not have enough movie reviews on the movies review website so we are not sure about how our audience react to the movie. However, there are a lot of reviews posted on the social media (Twitter, Facebook, etc.) and we have already scrapped these reviews. We want DS team to help us build models to predict audience sentiment.
Topic models can help us analyze the movies that are discussed by our audience. By doing topic modeling, we know the hot topics about movies, which are useful to our marketing strategies. By analyzing these topic, we can make better decision on future movie topics. Besides, we can classify the reviews by different topics. Then users can easily find the reviews of the topics that they are interested in.
Note: This notebook may not be scalable for huge dataset like 50GB. For future need, codes should be adjusted to the large-scale data processing framework like Spark.
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\
import re #regular expressions
import matplotlib.pyplot as plt #basic visuals
import seaborn as sns #more advnced visuals
from sklearn.feature_extraction.text import CountVectorizer #count vectorizer
from sklearn.model_selection import train_test_split #train test split
from sklearn.pipeline import Pipeline #import the pipeline
from sklearn.feature_extraction.text import TfidfVectorizer #import Tf idf Vectorizer
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn.metrics import classification_report, plot_confusion_matrix #classification report
df = pd.read_csv('/content/drive/MyDrive/DSO 560 NLP project/IMDB Dataset.csv')
df.head()
| review | sentiment | |
|---|---|---|
| 0 | One of the other reviewers has mentioned that ... | positive |
| 1 | A wonderful little production. <br /><br />The... | positive |
| 2 | I thought this was a wonderful way to spend ti... | positive |
| 3 | Basically there's a family where a little boy ... | negative |
| 4 | Petter Mattei's "Love in the Time of Money" is... | positive |
# check for missing values
df.isna().sum()
review 0 sentiment 0 dtype: int64
df['sentiment'].value_counts()
df['sentiment'].value_counts()
plt.figure(figsize = (8,4), dpi = 100)
sns.countplot(data = df, x = 'sentiment');
# 1. Negtive reviews
count_vect = CountVectorizer(stop_words = 'english')
matrix = count_vect.fit_transform(df[df.sentiment == 'negative']['review'])
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis = 0).tolist()[0])
#print sorted words
print(sorted(freqs, key = lambda x: -x[1])[:20])
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
[('br', 103997), ('movie', 50117), ('film', 37595), ('like', 22458), ('just', 21075), ('good', 14728), ('bad', 14726), ('time', 12358), ('really', 12355), ('don', 10622), ('story', 10185), ('people', 9469), ('make', 9355), ('movies', 8313), ('plot', 8214), ('acting', 8087), ('way', 7780), ('characters', 7353), ('watch', 7220), ('think', 7129)]
# 2. Positive reviews
count_vect = CountVectorizer(stop_words = 'english')
matrix = count_vect.fit_transform(df[df.sentiment == 'positive']['review'])
freqs = zip(count_vect.get_feature_names(), matrix.sum(axis = 0).tolist()[0])
#print sorted words
print(sorted(freqs, key = lambda x: -x[1])[:20])
[('br', 97954), ('film', 42110), ('movie', 37854), ('like', 17714), ('good', 15025), ('just', 14109), ('great', 12964), ('story', 12934), ('time', 12752), ('really', 10739), ('people', 8719), ('love', 8692), ('best', 8510), ('life', 8137), ('way', 7865), ('films', 7601), ('think', 7208), ('characters', 7103), ('don', 7001), ('movies', 6996)]
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
# change the label from string to 0s and 1s
label={'positive':1, 'negative':0}
df['positive']=df['sentiment'].map(label)
df.head()
| review | sentiment | positive | |
|---|---|---|---|
| 0 | One of the other reviewers has mentioned that ... | positive | 1 |
| 1 | A wonderful little production. <br /><br />The... | positive | 1 |
| 2 | I thought this was a wonderful way to spend ti... | positive | 1 |
| 3 | Basically there's a family where a little boy ... | negative | 0 |
| 4 | Petter Mattei's "Love in the Time of Money" is... | positive | 1 |
# HTML Tags
df['review'][2]
'I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I\'d laughed at one of Woody\'s comedies in years (dare I say a decade?). While I\'ve never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.'
# define clean function
clean_regex = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
def clean(old_text):
text = re.sub(clean_regex, '', old_text)
return text
# apply the fuction to the entire column
df['review'] = df['review'].apply(clean)
df['review'][2]
'I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.This was the most I\'d laughed at one of Woody\'s comedies in years (dare I say a decade?). While I\'ve never been impressed with Scarlet Johanson, in this she managed to tone down her "sexy" image and jumped right into a average, but spirited young woman.This may not be the crown jewel of his career, but it was wittier than "Devil Wears Prada" and more interesting than "Superman" a great comedy to go see with friends.'
# Remove punctuation except ? and !, which represent sentiment
df.review = df.review.str.replace(r"[^\w\s\!\?]", " ")
# check punctuation
df['review'][9]
'If you like original gut wrenching laughter you will like this movie If you are young or old then you will love this movie hell even my mom liked it Great Camp!!!'
import nltk
nltk.download('punkt') # A popular NLTK sentence tokenizer
nltk.download('stopwords') # library of common English stopwords
from nltk.corpus import stopwords
from nltk import word_tokenize, sent_tokenize
stp=set(stopwords.words('english'))
stp.add('movie')
stp.add('film')
reviews = []
for i in df.review:
new_words=[]
for w in word_tokenize(i):
if w not in stp:
new_words.append(w)
reviews.append(' '.join(new_words))
[nltk_data] Downloading package punkt to /root/nltk_data... [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package stopwords to /root/nltk_data... [nltk_data] Package stopwords is already up-to-date!
df.review = reviews
df['review'][0]
'One reviewers mentioned watching 1 Oz episode hooked They right exactly happened The first thing struck Oz brutality unflinching scenes violence set right word GO Trust show faint hearted timid This show pulls punches regards drugs sex violence Its hardcore classic use word It called OZ nickname given Oswald Maximum Security State Penitentary It focuses mainly Emerald City experimental section prison cells glass fronts face inwards privacy high agenda Em City home many Aryans Muslims gangstas Latinos Christians Italians Irish scuffles death stares dodgy dealings shady agreements never far away I would say main appeal show due fact goes shows dare Forget pretty pictures painted mainstream audiences forget charm forget romance OZ mess around The first episode I ever saw struck nasty surreal I say I ready I watched I developed taste Oz got accustomed high levels graphic violence Not violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street skills prison experience Watching Oz may become comfortable uncomfortable viewing thats get touch darker side'
# import nltk
# from nltk.stem import WordNetLemmatizer
# # nltk.download('wordnet')
# from nltk.corpus import wordnet
# # lemmatization
# lemmatizer = WordNetLemmatizer()
# # source
# # https://gaurav5430.medium.com/using-nltk-for-lemmatizing-sentences-c1bfff963258
# def lemmatize_sentence(sentence):
# #tokenize the sentence and find the POS tag for each token
# nltk_tagged = nltk.pos_tag(nltk.word_tokenize(sentence))
# #tuple of (token, wordnet_tag)
# wordnet_tagged = map(lambda x: (x[0], nltk_tag_to_wordnet_tag(x[1])), nltk_tagged)
# lemmatized_sentence = []
# for word, tag in wordnet_tagged:
# if tag is None:
# #if there is no available tag, append the token as is
# lemmatized_sentence.append(word)
# else:
# #else use the tag to lemmatize the token
# lemmatized_sentence.append(lemmatizer.lemmatize(word, tag))
# return lemmatized_sentence
# # function to convert nltk tag to wordnet tag
# def nltk_tag_to_wordnet_tag(nltk_tag):
# if nltk_tag.startswith('J'):
# return wordnet.ADJ
# elif nltk_tag.startswith('V'):
# return wordnet.VERB
# elif nltk_tag.startswith('N'):
# return wordnet.NOUN
# elif nltk_tag.startswith('R'):
# return wordnet.ADV
# else:
# return None
# reviews_lemma = [' '.join(lemmatize_sentence(i)) for i in df.review]
# df['cleaned_lemmatized_reviews'] = reviews_lemma
# df.head()
## perform stemming
from nltk.stem.porter import PorterStemmer
# stemming
stemmer = PorterStemmer()
stem = []
for sent in df.review:
sent_stem = []
for w in word_tokenize(sent):
w_stem = stemmer.stem(w)
sent_stem.append(w_stem)
stem.append( ' '.join(sent_stem))
df['cleaned_stem_reviews'] = stem
df.head()
| review | sentiment | positive | cleaned_stem_reviews | |
|---|---|---|---|---|
| 0 | One reviewers mentioned watching 1 Oz episode ... | positive | 1 | one review mention watch 1 Oz episod hook they... |
| 1 | A wonderful little production The filming tech... | positive | 1 | A wonder littl product the film techniqu unass... |
| 2 | I thought wonderful way spend time hot summer ... | positive | 1 | I thought wonder way spend time hot summer wee... |
| 3 | Basically family little boy Jake thinks zombie... | negative | 0 | basic famili littl boy jake think zombi closet... |
| 4 | Petter Mattei Love Time Money visually stunnin... | positive | 1 | petter mattei love time money visual stun watc... |
model_results = {}
# count vectorizer
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(max_df=0.3,
max_features=500)
X = count_vectorizer.fit_transform(df.cleaned_stem_reviews)
y = df["positive"].values
df_cv = pd.DataFrame(X.toarray(), columns=count_vectorizer.get_feature_names())
df_cv["positive"] = y
df_cv
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
| 10 | abl | absolut | act | action | actor | actress | actual | add | after | age | all | almost | along | alreadi | also | although | alway | amaz | american | and | anim | annoy | anoth | anyon | anyth | anyway | appar | appear | around | art | as | ask | at | attempt | audienc | aw | away | back | bad | ... | view | viewer | voic | wait | walk | want | war | wast | way | we | went | what | when | whi | while | white | whole | wife | wish | with | without | woman | women | wonder | word | work | world | wors | worst | worth | write | writer | written | wrong | ye | year | yet | you | young | positive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 49996 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 49997 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 49998 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 49999 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
50000 rows × 501 columns
# TFIDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,1),
max_df=0.3,
max_features=500)
X = tfidf_vectorizer.fit_transform(df.cleaned_stem_reviews)
y = df["positive"].values
df_tfidf = pd.DataFrame(X.toarray(), columns=tfidf_vectorizer.get_feature_names())
df_tfidf["positive"] = y
df_tfidf
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
| 10 | abl | absolut | act | action | actor | actress | actual | add | after | age | all | almost | along | alreadi | also | although | alway | amaz | american | and | anim | annoy | anoth | anyon | anyth | anyway | appar | appear | around | art | as | ask | at | attempt | audienc | aw | away | back | bad | ... | view | viewer | voic | wait | walk | want | war | wast | way | we | went | what | when | whi | while | white | whole | wife | wish | with | without | woman | women | wonder | word | work | world | wors | worst | worth | write | writer | written | wrong | ye | year | yet | you | young | positive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.098508 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.109768 | 0.0 | 0.210427 | 0.000000 | 0.000000 | ... | 0.113116 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.235316 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 1 |
| 1 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.114122 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.181458 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.137150 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.152255 | 0.0 | 0.0 | 0.170278 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.140115 | 0.000000 | 1 |
| 2 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.100592 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.343902 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.149274 | 0.0 | 0.132306 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.111396 | 0.0 | 0.000000 | 0.137926 | 1 |
| 3 | 0.143771 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.125288 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.302762 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0 |
| 4 | 0.000000 | 0.0 | 0.0 | 0.072078 | 0.103091 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.092156 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.074077 | 0.230325 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.127726 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.080871 | 0.098983 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 49995 | 0.125172 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.146317 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.270821 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.167244 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.135506 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 1 |
| 49996 | 0.000000 | 0.0 | 0.0 | 0.131825 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.238657 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.559427 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.148816 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0 |
| 49997 | 0.000000 | 0.0 | 0.0 | 0.189713 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.156995 | 0.153462 | 0.000000 | 0.0 | 0.000000 | 0.117436 | 0.100636 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.097486 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.154186 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0 |
| 49998 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.164351 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.177314 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0 |
| 49999 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.132869 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.181499 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.106803 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.154405 | 0.311894 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0 |
50000 rows × 501 columns
def traintest_split(df):
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, random_state=100) # set ramdom_state for stable output
X_train = train_df.loc[:, ~train_df.columns.isin(['positive'])]
y_train = train_df["positive"]
X_test = test_df.loc[:, ~test_df.columns.isin(['positive'])]
y_test = test_df["positive"]
return X_train, y_train, X_test, y_test
data = df[['positive','cleaned_stem_reviews']]
X_train, y_train, X_test, y_test = traintest_split(data)
len(X_train)
37500
len(X_test)
12500
def lr_result(vectorized_df):
X_train, y_train, X_test, y_test = traintest_split(vectorized_df)
## logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)
# predict
y_pred = lr.predict(X_test)
## Evaluate LR result
# calculate accuracy
accuracy = np.mean(y_pred == y_test)
# AUROC (area under the receiver operator curve)
from sklearn.metrics import roc_auc_score
roc = roc_auc_score(y_test, y_pred)
return accuracy, roc, y_pred
# LR using count_vectorizer
accuracy, roc, y_pred = lr_result(df_cv)
print('Accuracy: ',accuracy)
print('ROC: ',roc)
model_results['LR(count_vectorizer)'] = accuracy
Accuracy: 0.85224 ROC: 0.8521276642317995
# LR using TF-IDF vectorize
accuracy, roc, y_pred = lr_result(df_tfidf)
print('Accuracy: ',accuracy)
print('ROC: ',roc)
model_results['LR(TF-IDF)'] = accuracy
Accuracy: 0.85216 ROC: 0.8520561458697065
df
| review | sentiment | positive | cleaned_stem_reviews | |
|---|---|---|---|---|
| 0 | One reviewers mentioned watching 1 Oz episode ... | positive | 1 | one review mention watch 1 Oz episod hook they... |
| 1 | A wonderful little production The filming tech... | positive | 1 | A wonder littl product the film techniqu unass... |
| 2 | I thought wonderful way spend time hot summer ... | positive | 1 | I thought wonder way spend time hot summer wee... |
| 3 | Basically family little boy Jake thinks zombie... | negative | 0 | basic famili littl boy jake think zombi closet... |
| 4 | Petter Mattei Love Time Money visually stunnin... | positive | 1 | petter mattei love time money visual stun watc... |
| ... | ... | ... | ... | ... |
| 49995 | I thought right good job It creative original ... | positive | 1 | I thought right good job It creativ origin fir... |
| 49996 | Bad plot bad dialogue bad acting idiotic direc... | negative | 0 | bad plot bad dialogu bad act idiot direct anno... |
| 49997 | I Catholic taught parochial elementary schools... | negative | 0 | I cathol taught parochi elementari school nun ... |
| 49998 | I going disagree previous comment side Maltin ... | negative | 0 | I go disagre previou comment side maltin one t... |
| 49999 | No one expects Star Trek movies high art fans ... | negative | 0 | No one expect star trek movi high art fan expe... |
50000 rows × 4 columns
import spacy
import spacy.cli
spacy.cli.download("en_core_web_md")
nlp = spacy.load("en_core_web_md",disable=['ner','tagger','parser'])
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_md')
# preprocessing review
spacy_df = pd.DataFrame(df['review'].str.replace(r'<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});', '', case=False))
spacy_df.columns = ['prp_review']
spacy_df.head()
| prp_review | |
|---|---|
| 0 | One reviewers mentioned watching 1 Oz episode ... |
| 1 | A wonderful little production The filming tech... |
| 2 | I thought wonderful way spend time hot summer ... |
| 3 | Basically family little boy Jake thinks zombie... |
| 4 | Petter Mattei Love Time Money visually stunnin... |
from string import punctuation
spacy_df['prp_lem_review'] = spacy_df['prp_review'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)
if not token.is_stop
and token.text not in punctuation
and token.text!='\n']))
spacy_df.head()
| prp_review | prp_lem_review | |
|---|---|---|
| 0 | One reviewers mentioned watching 1 Oz episode ... | reviewer mention watch 1 Oz episode hook right... |
| 1 | A wonderful little production The filming tech... | wonderful little production film technique una... |
| 2 | I thought wonderful way spend time hot summer ... | think wonderful way spend time hot summer week... |
| 3 | Basically family little boy Jake thinks zombie... | Basically family little boy Jake think zombie ... |
| 4 | Petter Mattei Love Time Money visually stunnin... | Petter Mattei Love Time Money visually stun wa... |
spacy_df['word2vec'] = spacy_df['prp_review'].apply(lambda x: nlp(x).vector)
base_df = pd.concat([pd.DataFrame(spacy_df['word2vec'].values.tolist()), df['positive'].reset_index(drop=True)], axis=1)
base_df.head()
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | ... | 261 | 262 | 263 | 264 | 265 | 266 | 267 | 268 | 269 | 270 | 271 | 272 | 273 | 274 | 275 | 276 | 277 | 278 | 279 | 280 | 281 | 282 | 283 | 284 | 285 | 286 | 287 | 288 | 289 | 290 | 291 | 292 | 293 | 294 | 295 | 296 | 297 | 298 | 299 | positive | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.085625 | 0.113938 | -0.096713 | -0.050682 | 0.067096 | -0.027273 | -0.008549 | 0.017516 | 0.016594 | 1.980726 | -0.134060 | -0.013262 | 0.067707 | -0.044855 | -0.035980 | -0.057104 | -0.008305 | 0.798957 | -0.069566 | -0.098267 | 0.018137 | -0.044904 | 0.097195 | -0.052976 | 0.001954 | -0.108625 | -0.050712 | -0.043276 | -0.037111 | -0.069433 | -0.053573 | 0.061442 | -0.157201 | 0.003342 | 0.050568 | -0.010287 | -0.000302 | 0.030447 | -0.120839 | 0.019652 | ... | 0.095552 | -0.031201 | 0.035830 | 0.044112 | 0.012925 | 0.030884 | 0.047218 | 0.165926 | 0.048705 | -0.074791 | -0.048848 | -0.157731 | -0.102950 | -0.015641 | 0.027510 | 0.026498 | -0.010761 | -0.030305 | 0.206609 | 0.072768 | 0.043551 | 0.020540 | -0.034169 | -0.083799 | 0.039720 | 0.092597 | 0.054014 | 0.012308 | -0.039834 | -0.148019 | 0.010207 | -0.067727 | -0.037153 | 0.015345 | 0.057352 | -0.051846 | -0.060179 | 0.042206 | 0.072123 | 1 |
| 1 | -0.059508 | 0.101667 | -0.066424 | -0.012344 | 0.006142 | 0.040348 | 0.054405 | -0.075953 | -0.028763 | 1.865435 | -0.066720 | 0.050341 | -0.024497 | 0.004854 | 0.007449 | -0.011252 | -0.030919 | 0.946995 | -0.144023 | -0.059326 | 0.029295 | -0.031011 | 0.004169 | -0.046758 | 0.023570 | 0.025369 | 0.102018 | 0.031544 | -0.020643 | -0.100728 | -0.073243 | 0.067338 | -0.065873 | 0.011564 | 0.030689 | -0.055507 | 0.007980 | 0.038705 | -0.013779 | -0.087589 | ... | 0.057629 | 0.046665 | 0.019746 | 0.102003 | -0.008896 | 0.046167 | 0.001690 | 0.253655 | 0.119373 | -0.189553 | -0.025473 | -0.126789 | -0.190016 | -0.107141 | 0.016398 | 0.072218 | -0.066210 | 0.010131 | 0.187657 | 0.130190 | 0.038825 | -0.043276 | -0.120535 | -0.035319 | -0.052783 | 0.057975 | -0.042145 | 0.042719 | -0.115392 | -0.211747 | 0.024240 | -0.046898 | 0.068820 | 0.081703 | 0.052659 | 0.015161 | -0.038128 | -0.007154 | 0.051344 | 1 |
| 2 | -0.031895 | 0.128547 | -0.068776 | -0.092622 | 0.020030 | 0.055127 | 0.053285 | -0.131540 | 0.085579 | 1.988347 | -0.128655 | -0.002442 | -0.008967 | -0.042887 | -0.027007 | -0.025767 | -0.025582 | 0.791268 | -0.105340 | -0.024962 | -0.014827 | -0.106630 | 0.019857 | -0.102065 | -0.068406 | 0.023237 | -0.087062 | -0.064993 | 0.019546 | -0.056289 | -0.128369 | 0.046966 | -0.078863 | 0.021608 | 0.060748 | -0.065500 | 0.023451 | 0.011134 | -0.175108 | -0.068270 | ... | 0.145821 | -0.019991 | 0.045711 | 0.061955 | -0.040956 | 0.002337 | 0.126577 | 0.144331 | 0.145385 | -0.178633 | 0.013118 | -0.120895 | -0.143168 | -0.115829 | 0.067229 | 0.000064 | -0.091062 | -0.031855 | 0.238313 | 0.129045 | 0.048015 | -0.005121 | -0.052089 | -0.045701 | 0.050236 | 0.162557 | -0.000453 | 0.086091 | 0.015191 | -0.129507 | 0.044706 | -0.072763 | -0.004463 | 0.100818 | -0.000717 | -0.023600 | -0.082397 | -0.010813 | 0.046681 | 1 |
| 3 | -0.081507 | 0.027952 | -0.144960 | -0.068188 | -0.003852 | -0.011254 | 0.044061 | -0.186122 | 0.095581 | 1.980411 | -0.206378 | 0.053850 | -0.013778 | 0.048356 | -0.089570 | -0.055436 | 0.021578 | 0.584212 | -0.149846 | 0.046477 | -0.020595 | -0.087711 | 0.061625 | -0.054834 | 0.010359 | 0.056653 | -0.059649 | -0.095513 | 0.023260 | -0.132827 | -0.074311 | 0.007049 | -0.078337 | -0.010283 | 0.125862 | -0.076256 | 0.079149 | 0.134745 | -0.066503 | -0.022556 | ... | 0.143747 | 0.029207 | 0.067187 | 0.080861 | -0.013663 | -0.026975 | 0.056327 | 0.188322 | 0.159378 | -0.099348 | 0.011662 | -0.161579 | -0.092100 | -0.138555 | 0.010826 | -0.049577 | 0.011100 | -0.043279 | 0.194928 | 0.152051 | 0.051987 | -0.041930 | -0.096006 | 0.034518 | 0.050958 | 0.035877 | -0.080547 | -0.029784 | -0.037922 | -0.084655 | -0.003077 | -0.078988 | 0.023808 | 0.109924 | 0.084974 | -0.068735 | -0.067864 | -0.033867 | -0.031053 | 0 |
| 4 | -0.013270 | 0.111341 | -0.141966 | -0.086904 | 0.073514 | 0.015217 | 0.005798 | -0.019719 | 0.056314 | 1.996038 | -0.167431 | -0.050925 | 0.024698 | -0.056136 | -0.017730 | -0.023470 | -0.130222 | 0.883261 | -0.119780 | -0.012888 | -0.018163 | -0.038417 | -0.006539 | -0.059612 | 0.036331 | 0.060618 | -0.004736 | -0.006216 | -0.042526 | 0.039264 | -0.040787 | 0.030492 | -0.074793 | 0.038850 | 0.048998 | -0.027575 | 0.062670 | -0.050928 | -0.123921 | -0.119282 | ... | 0.100339 | -0.008039 | 0.042183 | 0.038958 | -0.147871 | 0.068978 | 0.108874 | 0.259209 | 0.141014 | -0.193415 | 0.012265 | -0.083056 | -0.101665 | -0.098894 | 0.044043 | 0.018726 | 0.027991 | 0.007609 | 0.163973 | 0.068285 | 0.042604 | -0.045394 | -0.030198 | 0.002238 | -0.037670 | 0.129354 | -0.052185 | 0.023485 | -0.019228 | -0.203349 | 0.048139 | -0.072654 | -0.032961 | 0.101331 | 0.024770 | -0.040568 | -0.085062 | -0.048905 | 0.064297 | 1 |
5 rows × 301 columns
from sklearn.model_selection import cross_validate
x_train = base_df.iloc[:,:300]
y_train = base_df['positive']
lr = LogisticRegression()
score = cross_validate(lr, x_train, y_train, scoring=['accuracy','roc_auc'], cv=10, n_jobs=-1,
return_train_score=True, return_estimator=True)
print(np.mean(score['test_accuracy']))
print(pd.DataFrame(score['test_accuracy'],score['train_accuracy']),'\n'*2)
print(np.mean(score['test_roc_auc']))
print(pd.DataFrame(score['test_roc_auc'],score['train_roc_auc']))
model_results['LR(spacy)'] = np.mean(score['test_accuracy'])
0.8562799999999999
0
0.857933 0.8630
0.858422 0.8626
0.859000 0.8546
0.858689 0.8578
0.859778 0.8476
0.858156 0.8638
0.860578 0.8422
0.859889 0.8524
0.858978 0.8574
0.858222 0.8614
0.931461168
0
0.933289 0.935096
0.933394 0.934185
0.934099 0.927936
0.933370 0.933249
0.934729 0.921500
0.933245 0.936184
0.934191 0.927668
0.933788 0.930676
0.933421 0.934585
0.933559 0.933533
import numpy as np
# NUM_SAMPLES = 3000
# good_reviews = open("../week1/good_amazon_toy_reviews.txt").readlines()
# bad_reviews = open("../week1/poor_amazon_toy_reviews.txt").readlines()
# sampled_good_reviews = good_reviews[:NUM_SAMPLES]
# sampled_bad_reviews = bad_reviews[:NUM_SAMPLES]
docs = df.review
labels = df.positive
from scipy.spatial.distance import cosine
spacy.cli.download("en_core_web_lg") #takes awhile to run since I used large
nlp = spacy.load("en_core_web_lg",disable=['ner','tagger','parser'])
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_lg')
#takes very long time
stopwords_removed_docs = list(
map(lambda doc: " ".join([token.text for token in nlp(doc) if not token.is_stop]), docs))
from keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=5000, oov_token="UNKNOWN_TOKEN")
tokenizer.fit_on_texts(stopwords_removed_docs)
def integer_encode_documents(docs, tokenizer):
return tokenizer.texts_to_sequences(docs)
# integer encode the documents
encoded_docs = integer_encode_documents(stopwords_removed_docs, tokenizer)
seq_lengths=[len(x) for x in encoded_docs]
pd.Series(seq_lengths).describe()
count 50000.000000 mean 100.933100 std 78.391106 min 3.000000 25% 52.000000 50% 74.000000 75% 123.000000 max 1298.000000 dtype: float64
sns.displot(seq_lengths,bins=20)
<seaborn.axisgrid.FacetGrid at 0x7fef0b65c2d0>
for i in [80,90,95,99]:
print(f"{i}th percentile of arr : ", np.percentile(seq_lengths, i))
80th percentile of arr : 141.0 90th percentile of arr : 201.0 95th percentile of arr : 266.0 99th percentile of arr : 403.0
MAX_SEQUENCE_LENGTH = 150 #judging from the sequence length distribution I chose 150 for now
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_docs = pad_sequences(encoded_docs, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
labels = to_categorical(encoder.fit_transform(labels))
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(padded_docs, labels, test_size=0.2)
from random import randint
from numpy import array, argmax, asarray, zeros
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Embedding
VOCAB_SIZE = int(len(tokenizer.word_index) * 1.1)
!wget https://dso-560-nlp-text-analytics.s3.amazonaws.com/glove6b100dtxt.zip
!unzip glove6b100dtxt.zip
--2021-12-18 02:27:45-- https://dso-560-nlp-text-analytics.s3.amazonaws.com/glove6b100dtxt.zip Resolving dso-560-nlp-text-analytics.s3.amazonaws.com (dso-560-nlp-text-analytics.s3.amazonaws.com)... 52.216.245.44 Connecting to dso-560-nlp-text-analytics.s3.amazonaws.com (dso-560-nlp-text-analytics.s3.amazonaws.com)|52.216.245.44|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 137847651 (131M) [application/zip] Saving to: ‘glove6b100dtxt.zip.3’ glove6b100dtxt.zip. 100%[===================>] 131.46M 53.5MB/s in 2.5s 2021-12-18 02:27:48 (53.5 MB/s) - ‘glove6b100dtxt.zip.3’ saved [137847651/137847651] Archive: glove6b100dtxt.zip replace glove.6B.100d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename:
def load_glove_vectors():
embeddings_index = {}
with open('glove.6B.100d.txt') as f:
for line in f:
values = line.split()
word = values[0]
coefs = asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
print('Loaded %s word vectors.' % len(embeddings_index))
return embeddings_index
embeddings_index = load_glove_vectors()
Loaded 400000 word vectors.
# create a weight matrix for words in training docs
embedding_matrix = zeros((VOCAB_SIZE, 100))
for word, i in tokenizer.word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None: # check that it is an actual word that we have embeddings for
embedding_matrix[i] = embedding_vector
from keras.layers.recurrent import SimpleRNN, LSTM
from keras.layers import Flatten, Masking
# define model
def make_binary_classification_rnn_model(plot=False):
model = Sequential()
model.add(Embedding(VOCAB_SIZE, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
model.add(Masking(mask_value=0.0)) # masking layer, masks any words that don't have an embedding as 0s.
model.add(SimpleRNN(units=64, input_shape=(1, MAX_SEQUENCE_LENGTH)))
model.add(Dense(16))
model.add(Dense(2, activation='softmax'))
# Compile the model
model.compile(
optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
model.summary()
if plot:
plot_model(model, to_file='model.png', show_shapes=True)
return model
def make_lstm_classification_model(plot=False):
model = Sequential()
model.add(Embedding(VOCAB_SIZE, 100, weights=[embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False))
model.add(Masking(mask_value=0.0)) # masking layer, masks any words that don't have an embedding as 0s.
model.add(LSTM(units=32, input_shape=(1, MAX_SEQUENCE_LENGTH)))
model.add(Dense(16))
model.add(Dense(2, activation='softmax'))
# Compile the model
model.compile(
optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
model.summary()
if plot:
plot_model(model, to_file='model.png', show_shapes=True)
return model
Where $d$ is the number of memory cells, and $N$ is the number of dimensions for a data point.
model = make_lstm_classification_model()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 150, 100) 11384500
masking_1 (Masking) (None, 150, 100) 0
lstm_1 (LSTM) (None, 32) 17024
dense_2 (Dense) (None, 16) 528
dense_3 (Dense) (None, 2) 34
=================================================================
Total params: 11,402,086
Trainable params: 17,586
Non-trainable params: 11,384,500
_________________________________________________________________
from keras import callbacks
callback = callbacks.EarlyStopping(monitor='val_loss', patience=2)
# fit the model
history = model.fit(X_train, y_train,validation_split = 0.1, epochs=20, verbose=1, callbacks=[callback])
Epoch 1/20 1125/1125 [==============================] - 128s 112ms/step - loss: 0.4361 - accuracy: 0.7959 - val_loss: 0.3692 - val_accuracy: 0.8340 Epoch 2/20 1125/1125 [==============================] - 126s 112ms/step - loss: 0.3342 - accuracy: 0.8547 - val_loss: 0.3817 - val_accuracy: 0.8365 Epoch 3/20 1125/1125 [==============================] - 119s 106ms/step - loss: 0.3007 - accuracy: 0.8704 - val_loss: 0.3113 - val_accuracy: 0.8627 Epoch 4/20 1125/1125 [==============================] - 118s 105ms/step - loss: 0.2787 - accuracy: 0.8822 - val_loss: 0.3186 - val_accuracy: 0.8680 Epoch 5/20 1125/1125 [==============================] - 119s 105ms/step - loss: 0.2588 - accuracy: 0.8919 - val_loss: 0.3068 - val_accuracy: 0.8720 Epoch 6/20 1125/1125 [==============================] - 118s 105ms/step - loss: 0.2439 - accuracy: 0.8994 - val_loss: 0.2954 - val_accuracy: 0.8780 Epoch 7/20 1125/1125 [==============================] - 118s 105ms/step - loss: 0.2275 - accuracy: 0.9069 - val_loss: 0.2965 - val_accuracy: 0.8750 Epoch 8/20 1125/1125 [==============================] - 118s 105ms/step - loss: 0.2130 - accuracy: 0.9120 - val_loss: 0.3108 - val_accuracy: 0.8727 Epoch 9/20 1125/1125 [==============================] - 125s 111ms/step - loss: 0.1987 - accuracy: 0.9189 - val_loss: 0.3331 - val_accuracy: 0.8705
import keras
from matplotlib import pyplot as plt
def plot_fit_history(history):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()
plot_fit_history(history)
# evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=1)
print('Accuracy: %f' % (accuracy*100))
model_results['LSTM'] = accuracy
313/313 [==============================] - 10s 29ms/step - loss: 0.3360 - accuracy: 0.8739 Accuracy: 87.390000
Where $d$ is the number of memory cells, and $N$ is the number of dimensions for a data point.
model = make_binary_classification_rnn_model()
Model: "sequential_2"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_2 (Embedding) (None, 150, 100) 11384500
masking_2 (Masking) (None, 150, 100) 0
simple_rnn (SimpleRNN) (None, 64) 10560
dense_4 (Dense) (None, 16) 1040
dense_5 (Dense) (None, 2) 34
=================================================================
Total params: 11,396,134
Trainable params: 11,634
Non-trainable params: 11,384,500
_________________________________________________________________
from keras import callbacks
callback = keras.callbacks.EarlyStopping(monitor='val_loss', patience=2)
# fit the model
history = model.fit(X_train, y_train,validation_split = 0.1, epochs=20, verbose=1, callbacks=[callback])
Epoch 1/20 1125/1125 [==============================] - 61s 54ms/step - loss: 0.6334 - accuracy: 0.6464 - val_loss: 0.5531 - val_accuracy: 0.7340 Epoch 2/20 1125/1125 [==============================] - 53s 47ms/step - loss: 0.6189 - accuracy: 0.6541 - val_loss: 0.6229 - val_accuracy: 0.6425 Epoch 3/20 1125/1125 [==============================] - 54s 48ms/step - loss: 0.5939 - accuracy: 0.6809 - val_loss: 0.5731 - val_accuracy: 0.7125 Epoch 4/20 1125/1125 [==============================] - 53s 47ms/step - loss: 0.5586 - accuracy: 0.7166 - val_loss: 0.4907 - val_accuracy: 0.7715 Epoch 5/20 1125/1125 [==============================] - 54s 48ms/step - loss: 0.5969 - accuracy: 0.6830 - val_loss: 0.6505 - val_accuracy: 0.6180 Epoch 6/20 1125/1125 [==============================] - 52s 47ms/step - loss: 0.5798 - accuracy: 0.6982 - val_loss: 0.5427 - val_accuracy: 0.7337 Epoch 7/20 1125/1125 [==============================] - 52s 47ms/step - loss: 0.6021 - accuracy: 0.6785 - val_loss: 0.6116 - val_accuracy: 0.6183
import keras
from matplotlib import pyplot as plt
def plot_fit_history(history):
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='upper left')
plt.show()
plot_fit_history(history)
# evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=1)
print('Accuracy: %f' % (accuracy*100))
model_results['RNN'] = accuracy
313/313 [==============================] - 6s 18ms/step - loss: 0.6173 - accuracy: 0.6121 Accuracy: 61.210001
# test_docs = [
# "Amazing, my son loved it as soon as he opened it.",
# "Piece of crap junk broke as soon as it was opened.",
# "Solid toy, it was easy to set up and still works even years later"
# ]
# test_docs = list(
# map(lambda doc: " ".join([token.text for token in nlp(doc) if not token.is_stop]), test_docs))
# encoded_test_sample = integer_encode_documents(test_docs, tokenizer)
# padded_test_docs = pad_sequences(encoded_test_sample, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
# prediction = model.predict_classes(padded_test_docs)
# encoder.inverse_transform(prediction)
!pip install transformers
!pip install sentence-transformers
Collecting transformers
Downloading transformers-4.14.1-py3-none-any.whl (3.4 MB)
|████████████████████████████████| 3.4 MB 31.0 MB/s
Collecting huggingface-hub<1.0,>=0.1.0
Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
|████████████████████████████████| 61 kB 433 kB/s
Collecting pyyaml>=5.1
Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
|████████████████████████████████| 596 kB 38.3 MB/s
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers) (3.4.0)
Collecting tokenizers<0.11,>=0.10.1
Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
|████████████████████████████████| 3.3 MB 38.5 MB/s
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.7/dist-packages (from transformers) (4.62.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers) (2.23.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers) (4.8.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (2019.12.20)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers) (21.3)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.7/dist-packages (from transformers) (1.19.5)
Collecting sacremoses
Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
|████████████████████████████████| 895 kB 56.3 MB/s
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.7/dist-packages (from huggingface-hub<1.0,>=0.1.0->transformers) (3.10.0.2)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers) (3.0.6)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers) (3.6.0)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (3.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (2.10)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers) (1.24.3)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (7.1.2)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.15.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers) (1.1.0)
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
Attempting uninstall: pyyaml
Found existing installation: PyYAML 3.13
Uninstalling PyYAML-3.13:
Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.2.1 pyyaml-6.0 sacremoses-0.0.46 tokenizers-0.10.3 transformers-4.14.1
Collecting sentence-transformers
Downloading sentence-transformers-2.1.0.tar.gz (78 kB)
|████████████████████████████████| 78 kB 6.9 MB/s
Requirement already satisfied: transformers<5.0.0,>=4.6.0 in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (4.14.1)
Requirement already satisfied: tokenizers>=0.10.3 in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (0.10.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (4.62.3)
Requirement already satisfied: torch>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (1.10.0+cu111)
Requirement already satisfied: torchvision in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (0.11.1+cu111)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (1.19.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (1.0.1)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (1.4.1)
Requirement already satisfied: nltk in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (3.2.5)
Collecting sentencepiece
Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
|████████████████████████████████| 1.2 MB 40.7 MB/s
Requirement already satisfied: huggingface-hub in /usr/local/lib/python3.7/dist-packages (from sentence-transformers) (0.2.1)
Requirement already satisfied: typing-extensions in /usr/local/lib/python3.7/dist-packages (from torch>=1.6.0->sentence-transformers) (3.10.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2019.12.20)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (6.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (21.3)
Requirement already satisfied: sacremoses in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (0.0.46)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (2.23.0)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (4.8.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.7/dist-packages (from transformers<5.0.0,>=4.6.0->sentence-transformers) (3.4.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/lib/python3.7/dist-packages (from packaging>=20.0->transformers<5.0.0,>=4.6.0->sentence-transformers) (3.0.6)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata->transformers<5.0.0,>=4.6.0->sentence-transformers) (3.6.0)
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from nltk->sentence-transformers) (1.15.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers) (2021.10.8)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->transformers<5.0.0,>=4.6.0->sentence-transformers) (1.24.3)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers<5.0.0,>=4.6.0->sentence-transformers) (1.1.0)
Requirement already satisfied: click in /usr/local/lib/python3.7/dist-packages (from sacremoses->transformers<5.0.0,>=4.6.0->sentence-transformers) (7.1.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->sentence-transformers) (3.0.0)
Requirement already satisfied: pillow!=8.3.0,>=5.3.0 in /usr/local/lib/python3.7/dist-packages (from torchvision->sentence-transformers) (7.1.2)
Building wheels for collected packages: sentence-transformers
Building wheel for sentence-transformers (setup.py) ... done
Created wheel for sentence-transformers: filename=sentence_transformers-2.1.0-py3-none-any.whl size=121000 sha256=7befd60af352afb4f2039967d356ada5be1073ff0358d1b787a407c8f1147b34
Stored in directory: /root/.cache/pip/wheels/90/f0/bb/ed1add84da70092ea526466eadc2bfb197c4bcb8d4fa5f7bad
Successfully built sentence-transformers
Installing collected packages: sentencepiece, sentence-transformers
Successfully installed sentence-transformers-2.1.0 sentencepiece-0.1.96
from transformers import pipeline
# from transformers import BertTokenizer
from transformers import BertTokenizerFast
classifier = pipeline('sentiment-analysis')
# tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased', do_lower_case=False)
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
classifier(df['review'][0])
[{'label': 'NEGATIVE', 'score': 0.9896913170814514}]
tokenizer.decode(tokenizer.encode(df['review'][0], padding=True, truncation=True,max_length=500, add_special_tokens = True))
'[CLS] One reviewers mentioned watching 1 Oz episode hooked They right exactly happened The first thing struck Oz brutality unflinching scenes violence set right word GO Trust show faint hearted timid This show pulls punches regards drugs sex violence Its hardcore classic use word It called OZ nickname given Oswald Maximum Security State Penitentary It focuses mainly Emerald City experimental section prison cells glass fronts face inwards privacy high agenda Em City home many Aryans Muslims gangstas Latinos Christians Italians Irish scuffles death stares dodgy dealings shady agreements never far away I would say main appeal show due fact goes shows dare Forget pretty pictures painted mainstream audiences forget charm forget romance OZ mess around The first episode I ever saw struck nasty surreal I say I ready I watched I developed taste Oz got accustomed high levels graphic violence Not violence injustice crooked guards sold nickel inmates kill order get away well mannered middle class inmates turned prison bitches due lack street skills prison experience Watching Oz may become comfortable uncomfortable viewing thats get touch darker side [SEP]'
# Since classifier specifies maximum sequence length which is 512, we have to cut sequence length
def cut_review(x):
return tokenizer.decode(tokenizer.encode(x, padding=True, truncation=True, max_length=500, add_special_tokens = True))
raw_review_cut = spacy_df.apply(lambda x: cut_review(x['prp_review']), axis=1)
# Sincer bert is kind of slow, we pridict first 1000 reviews as an example
raw_review_prd = raw_review_cut[:1000].apply(lambda x: classifier(x)[0]['label']).map({'POSITIVE':1, 'NEGATIVE':0})
print('Accuracy with raw text: ', (df['positive'][:1000] == raw_review_prd).mean())
model_results['HuggingFace'] = (df['positive'][:1000] == raw_review_prd).mean()
Accuracy with raw text: 0.82
# summarizer = pipeline("summarization")
# # Summerization took a lot of time and could not increase accuracy significantly
# summerization_review = spacy_df['prp_review'][:100].apply(lambda x: summarizer(x, max_length=500, do_sample=False)[0]['summary_text'])
# summerization_review_cut = summerization_review.apply(lambda x: cut_review(x))
# summerization_review_prd = summerization_review_cut.apply(lambda x: classifier(x)[0]['label']).map({'POSITIVE':1, 'NEGATIVE':0})
# print('Accuracy with raw text: ', (df['positive'][:100] == summerization_review_prd).mean())
The output would be
Accuracy with raw text: 0.82
pd.DataFrame(model_results, index=['Accuracy']).T
| Accuracy | |
|---|---|
| LR(count_vectorizer) | 0.85224 |
| LR(TF-IDF) | 0.85216 |
| LR(spacy) | 0.85628 |
| LSTM | 0.87390 |
| RNN | 0.61210 |
| HuggingFace | 0.82000 |
# add more stopwords
# We add some new stopwords including some verbs, adverbs
# and words about sentiment to better analyze topics of the movies
new_stp=stp
to_add=["worst","bad","good", "better", "like", "just","really", "great", "best", "ever", "seen", "waste",
"wast", "time", "money", "see", "saw", "well", "worth", "recommond", "blah", "anyon", "year", "old",
"even", "though", "although", "charact", "main", "theater", "hour", "minute", "watch", "minut",
"never", "make", "take", "would", "highly", "video"]
for w in new_stp:
w_stem = stemmer.stem(w)
if w_stem!=w and w_stem not in new_stp:
to_add.append(w_stem)
for t in to_add:
new_stp.add(t)
# define function for regex cleaning
def replace_x(x, y, text):
for i in range(0, len(text)):
text[i]=re.sub(x, y,text[i], flags=re.IGNORECASE)
return text
# remove numbers
df["cleaned_stem_reviews"]=replace_x(r"\d+", '', df["cleaned_stem_reviews"])
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy after removing the cwd from sys.path.
# We tried ngram=2 and ngram=3 and we found ngram=3 was more meaningful.
# Although infrequent words are useful for us to analyze the texts,
# they may be less useful in topic modeling because we want to find topics in common.
# min_df=10 can help us get rid of those words and lower dimensions.
# max_df=0.4 help us get rid of common words appear in too many documents and topics.
from sklearn.decomposition import NMF
vectorizer = TfidfVectorizer(ngram_range=(3,3),
min_df = 10, max_df=0.4, stop_words=new_stp)
X_reviews, reviews_terms = vectorizer.fit_transform(df["cleaned_stem_reviews"]), vectorizer.get_feature_names()
reviews_tf_idf = pd.DataFrame(X_reviews.toarray(), columns=reviews_terms)
print(f"Reviews TF-IDF: {reviews_tf_idf.shape}")
reviews_tf_idf.head(5)
/usr/local/lib/python3.7/dist-packages/sklearn/utils/deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
Reviews TF-IDF: (50000, 2011)
| abbott costello meet | absolut noth els | absolut noth go | absolut noth happen | absolut noth new | absolut sens whatsoev | academi award actor | academi award nomin | academi award win | academi award winner | accord dvd sleev | act actual pretti | act camera work | act direct cinematographi | act direct script | act direct write | act entir cast | act everyon involv | act first rate | act horribl plot | act low budget | act poor direct | act poor script | act pretti much | act product valu | act special effect | act stori line | act sub par | act terribl plot | act terribl script | act top notch | act two lead | act write direct | action fight scene | action sci fi | action set piec | action special effect | actor actress play | actor deliv line | actor fine job | ... | without give much | without know anyth | without say anyth | witti one liner | woman fall love | wonder mani peopl | wong kar wai | word come mind | work low budget | work mani level | world trade center | world war ii | world war one | world war two | wors special effect | write act direct | write direct act | write ten line | writer actor director | writer director actor | writer director john | writer director produc | writer produc director | written direct act | written direct produc | written produc direct | wrong place wrong | wrong side track | wrong wrong wrong | www imdb com | yadda yadda yadda | ye read right | yet anoth exampl | yet still manag | young boy name | young girl name | young man name | young men women | young mr lincoln | young woman name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 2011 columns
nmf = NMF(n_components=5)
W_reviews = nmf.fit_transform(X_reviews)
H_reviews = nmf.components_
print(f"Original shape of X sports is {X_reviews.shape}")
print(f"Decomposed W sports matrix is {W_reviews.shape}")
print(f"Decomposed H sports matrix is {H_reviews.shape}")
/usr/local/lib/python3.7/dist-packages/sklearn/decomposition/_nmf.py:294: FutureWarning: The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26). FutureWarning,
Original shape of X sports is (50000, 2011) Decomposed W sports matrix is (50000, 5) Decomposed H sports matrix is (5, 2011)
from typing import List
import numpy as np
def get_top_tf_idf_tokens_for_topic(H: np.array, feature_names: List[str], num_top_tokens: int = 5):
"""
Uses the H matrix (K components x M original features) to identify for each
topic the most frequent tokens.
"""
for topic, vector in enumerate(H):
print(f"TOPIC {topic}\n")
total = vector.sum()
top_scores = vector.argsort()[::-1][:num_top_tokens]
token_names = list(map(lambda idx: feature_names[idx], top_scores))
strengths = list(map(lambda idx: vector[idx] / total, top_scores))
for strength, token_name in zip(strengths, token_names):
print(f"\b{token_name} ({round(strength * 100, 1)}%)\n")
print(f"=" * 50)
print(f"Movie Reviews Topics:\n\n")
get_top_tf_idf_tokens_for_topic(H_reviews, reviews_tf_idf.columns.tolist(), 5)
Movie Reviews Topics: TOPIC 0 new york citi (59.2%) live new york (1.8%) street new york (1.6%) set new york (1.0%) citi new york (1.0%) ================================================== TOPIC 1 world war ii (64.0%) post world war (1.9%) save privat ryan (1.3%) kristin scott thoma (1.3%) sit back enjoy (0.8%) ================================================== TOPIC 2 base true stori (69.4%) stori base true (1.1%) real life stori (0.7%) top notch perform (0.6%) dream come true (0.6%) ================================================== TOPIC 3 sci fi channel (53.0%) fi channel origin (3.2%) john rhi davi (1.8%) made sci fi (1.7%) horror sci fi (1.7%) ================================================== TOPIC 4 texa chainsaw massacr (22.4%) low budget horror (12.7%) blair witch project (7.5%) night live dead (2.7%) budget horror flick (1.8%) ==================================================
def get_top_documents_for_each_topic(W: np.array, documents: List[str], num_docs: int = 5):
sorted_docs = W.argsort(axis=0)[::-1]
top_docs = sorted_docs[:num_docs].T
per_document_totals = W.sum(axis=1)
for topic, top_documents_for_topic in enumerate(top_docs):
print(f"Topic {topic}")
for doc in top_documents_for_topic:
score = W[doc][topic]
percent_about_topic = round(score / per_document_totals[doc] * 100, 1)
print(f"{percent_about_topic}%", documents[doc])
print("=" * 50)
get_top_documents_for_each_topic(W_reviews, df.review.tolist(), num_docs=5)
Topic 0 100.0% I fan Adam Sandler In fact I think I ever liked anything I seen The opening scene confirmed worst fears There Adam Sandler playing somewhat ridiculous looking character riding around New York City motor scooter looking pitiful lost Typical Sandler type loser character I thought I almost gave But I stuck I actually discovered something I never knew Adam Sandler act ! He truly outstanding Charlie lost lonely figure whose entire family including dog killed one hijacked planes 9 11 apparently lost touch reality result Don Cheadle plays former college roommate unexpectedly reconnects Charlie takes mission help get better Of course Cheadle Alan Johnson problems sources unhappiness somehow two men manage help difficulties The two made completely believable team Sandler particular made Charlie real working emotions feelings This Sandler comedy If looking go sillier stuff This pretty heavy sometimes sad sometimes hopeful always engrossing There funny parts I loved scene Charlie convinces Alan confront partners reminding tough college conversation two afterward I personally think Saffron Burrows added much Donna obviously needy patient Johnson The reason character seemed based one flashback looked eerily like Charlie late wife never really developed I care much character Do look part judge however played Donald Sutherland I thought nailed part bang As far I concerned though Sandler kudos great performance Definitely best opinion 8 10 100.0% I like underground films something say show matter I tried hard like Trash I tried see artistic achievement interesting representation New York City life early 70 Or least entertained ? But stinks called either art entertainment Trash basically excuse expose Joe Dallesandro nude body 2 hours meets uninteresting annoying figures I agree gorgeous body excuse whole right ? Holly Woodlawn Joe girlfriend provides good laughs end late save wasted couple hours Lou Reed classic song Walk Wild Side better portrayal people time even fascinating actually 1 5 10 100.0% John Cassavette decided first obviously one shot shoestring New York even script dialog delivers 1959 feature equivalent Larry David Curb Your Enthusiasm actors know say even right look eyes talk In words one realistic looks beat generation jazzed sweetly score telling tale racial tensions A group black siblings center point one trying get better gigs average strip club sister much light skinned gets entwined white man relationship shatters sides The however exclusively Cassavettes likes characters wander around New York City many films 1959 1960 style storytelling like improvisational jazz artists day Dated sure worth glance buffs Martin Scorsese named one heaviest influences 100.0% I never heard architect Louis Kahn documentary In almost two hour documentary goes quickly son Nathaniel Kahn explores father life Estonia slums North Philadelphia University Pennsylvania West Philadelphia studied taught He travels Bangladesh Israel Connecticut Trenton La Jolla California well New York City explore father creative genius Personally Louis Kahn three families including wife Esther refused give divorce daughter Sue Ann Nathaniel includes family members Louis also another half sister Alexandra Tynge father previous relationship Anne Tyne fellow architect Louis passion art 100.0% This seems start middle introduces peripheral players significant presents main characters without substance paper thin impenetrable back stories Almost nobody credible discernible motivation actions The plot rambles ultimately goes nowhere dialogue clunky trite director little concept get best actors It almost feels like first half never got made We told mysterious pyramid appeared New York City Central Park inexplicably become arctic wasteland Yet none seems caused much stir amongst general population minimal concern government We presented evil corporation Eugenics really convenient conceit populate universe couple power tripping minions The whole Eugenics bad double entendre heavy handed never really pans We supposed care central characters never learn enough know So much world underdeveloped completely undeveloped comes 1 hour 40 minute fatalistic rationalization rape On ground breaking digitally animated world created I say time made several directors thing seamless believable results After spending time watch burning question left mind ? ================================================== Topic 1 100.0% It clear right beginning 9 11 would inspire many films World War II Vietnam combined however certainly big danger films come good rather bad Pearl Harbor It great luck first international release 9 11 cheesy love story starring bunch pretty faces collective work 11 directors entire world I intending say 11 episodes great Youssef Chahine example needless prologue many cuts Shohei Imamura really bizarre ending segments right order Imamura one referring directly Twin Towers open end Alejandro Gonzales Inarritu last one instead impressive one But impressing effort interesting portrayal way parts world react collapse twin towers Consider Samira Makhmalbaf opening segment Afghan teachers tries explain pupils happened New York unsuccessfully suggests one minute silence Or Idrissa Ouedraogo part features bin Laden double much resembling real one shocked see I promise 5 boys muse good things done reward put Laden There surprisingly good extremely angry segment Ken Loach man Chile talking calls Tuesday September 11 September 11 1973 elected president Allende killed Pinochet installed dictatorship generous help Henry Kissinger CIA This could become terrible effort Anti Americanism become sad tale shares recognition best segment Inarritu mainly sound impressions phone calls hijacked planes black screen sometimes pictures people falling WTC finally collapsing tower ending screen brightening one question appearing Amos Gitai hysterical reporter trying desperatly get air car bomb exploded Tel Aviv hard recognize one masterpiece choreography All different segments I mentioned yet Claude Lelouch deaf girl Danis Tanovic demonstration Women Srebrenica Mira Nair strange takes Indian director make part probably appealing Western tastes Muslim family whose son terrible suspicion 9 11 Sean Penn Ernest Borgnine yes Ernest Borgnine widower leading depressive life one imagine add unique easy watch hard forget I sure classic known everyone thirty years I hope remembered starting long tradition world cinema movies But alas far probable remembered one effort And one 9 11 movies made reduce terrible event love story happy end please audience 100.0% Bamboo House Dolls 1973 1974 1977 various years given title Hong Kong veteran Chin Hung Kuei Killer Snakes Boxer Omen Payment Blood etc women prison flick produced legendary Shaw Brothers Yes even got hands low exploitation sickies like Bamboo definitely among worse attempts whole genre even compared Western attempts usually pale comparison Eastern films ! The story Japanese war camp Chinese women brutalized abused raped bad Japanese else ? World War II The girls also know secret place box full gold hidden also learn Chinese military officer raised Japan Shaw veteran Lo Lieh actually undercover agent among Japanese naturally helps girls escape hell What follows sequences full gratuitous nudity female kung fu nasty torture gore sleaze extremely offensive anti Japan attitude make pure honest garbage even try There hardly interesting elements Bamboo House Dolls The occasional photography especially end looks nice sunbeams beautiful nature merits department The fight scenes plenty always include half naked females hitting kicking The violence overall quite nasty times several bullet wounds misogynistic torture scenes example one poor girl brutalized floor filled broken glass etc extremely repulsive ending moral behind Of course stupid talk moral writing kind still elements I accept found The also enjoyable turkey elements sure ! For example gold box filled heavy gold seems suspiciouly light weak suffered girls seem problems lifting moving speak throwing ! Also numerous skin fight scenes make quite smile inducing fans trash cinema I seen director Killer Snakes 1973 ten times noteworthy piece even though many alive snakes killed real also visually interesting shows us nasty sides side big city society Also must fear snakes Bamboo House Dolls suffered censorship surprise considered subject matter The uncut version dubbed non English language released Europe least France Italy Switzerland runs 104 PAL minutes cut English dubbed print released Holland Belgium Greece runs 84 minutes PAL From I heard cut scenes violence graphic stuff also dialogue plot development like Bamboo House Dolls garbage cinema trashy form definitely something I liked see Shaw Brothers Hong Kong general Some Italian exploitation films subject matter much interesting noteworthy quite ridiculous calculated worthless piece cinema exploitation 2 10 100.0% During Kurt Weill celebration Brooklyn WHERE DO WE GO FROM HERE ? finally unearthed screening It amazing motion picture era Weill Gershwin collaborations possibly missing screens The score stands tall CD material Gershwin Weill underscores merits considerable Yes problems score one Ratoff element director musical fantasy Fred MacMurray quite grasp material Then modern segment weakly written BUT fantasy elements carry high mark work two delightful leading ladies Joan Leslie June Haver Both charm kind work desperately needs work As World War II salute country history albeit never framework place Hollywood musical history available see find considerable merits 100.0% Goodnight Mister Tom begins impossibly exquisite village south England sun always seems shine Before much idea period hear radio announcement declaration World War II Soon train blowing clouds steam brings refugee children London shy little William billeted reluctant gruff old Tom know turn heart gold tale begins And load sentimental claptrap In fact old odd couple buddy formula Aren new stories written ? As I suggested hardly period feel village much London apart odd old ambulance rattling around And certainly hint horror Blitz London citizens file politely air raid shelters Even local schoolteacher husband declared missing presumed killed later restored life I found Goodnight Mister Tom cliched obvious John Thaw accent conjured picture Ronnie Barker Two Ronnies straw mouth country bumpkin accent Incidentally wife enjoyed reasons I disliked looking fellow imdb reviewers I seem minority one 100.0% During Kurt Weill celebration Brooklyn WHERE DO WE GO FROM HERE ? finally unearthed screening It amazing motion picture era Weill Gershwin collaborations possibly missing screens The score stands tall CD material Gershwin Weill underscores merits considerable Yes problems score one Ratoff element director musical fantasy Fred MacMurray quite grasp material Then modern segment weakly written BUT fantasy elements carry high mark work two delightful leading ladies Joan Leslie June Haver Both charm kind work desperately needs work As World War II salute country history albeit never framework place Hollywood musical history available see find considerable merits ================================================== Topic 2 100.0% I watching Perfect Storm thought another Wolfgang Peterson much better one Although certainly based true story In Line Fire made It terrific story great cast Malkovich well deserved Oscar performance creepy killer grudge government served well Eastwood good tortured Secret Service Russo easy eyes If seen definitely rent buy I Definitely one best crime thrillers past decade 9 10 100.0% I saw school absolutely loved Based true story absolutely splendid masterpiece Seriously I find anything wrong One definite plus filmed Set Morrocco 1904 Wind Lion filled stirring images like Great Raisuli horseback especially The cinematography faultless editing crisp costumes gorgeous scenery breathtaking And I mention music Jerry Goldsmith phenomenal I used phrase lot recently Goldsmith favourite composer nothing His score rousing exciting shows man true musical genius gem score Goldsmith best scores Legend Rambo First Blood Patton The Secret NIMH The action exhilarating screenplay intelligent sophisticated The direction sensitively handled The performances astounding well Sean Connery ever picture charisma suavity magnificent Great Raisuli almost dominates entire picture He joined feisty Candice Bergen wily John Huston captivating Brian Keith one understated performances The history fairly accurate perhaps flimsy areas acting music visuals good I past caring 10 10 Bethany Cox 100.0% No Alice fairy tale friends ! This Wonderland fable based true story gruesome bloody Wonderland murders occurred back 80 California At center bloodbath Johnny Wad Yes John Holmes ! Daddy ding dong used shotguns infamous 13 inch milk machine Besides legendary adult actor Holmes also hardcore drug addict befriended various Hollywood junkies Val Kilmer occasionally majestic Holmes Holmes character milk completely The possesses supporting players Josh Lucas Dylan Mcdermott Hollywood riffraffs Kate Bosworth Lisa Kudrow women Holmes life Eric Bogosian menacing Tinsletown entrepreneur These characters play integral parts directly indirectly Wonderland murders Out support group Josh Lucas fierce impressive ardent Ron Launius Lucas gradually escalating major Hollywood player charismatic turns A Beautiful Mind Sweet Home Alabama Director James Cox sometime proved bit coxsucker displaying vast amount overextended scenes like Holmes famous organ Holmes eventually acquitted Wonderland murders He died complications Aids virus Wonderland keep wondering really happened bloody night Holmes really laid weapon Oops ! Wrong Holmes ! Ok ! That enough I get penislized I mean penalized Bye Holmies ! Average 100.0% I liked Although halfway I able tell secret admirer I also wondering based true story since told real people end I guess I research let ya know Does anyone remember state happened ? I believe moved North Carolina I mistaken Of course states could changed protect innocent You would think man could figured easily I Was stupid ? 100.0% Recap Based true story Charlie Wilson American Congressman according instrumental USA covert war Afghanistan Soviet Union Comments A rather funny funny things especially since real But focusing Hanks performs well mischievous womanizing Congressman good heart becomes champion covert war Afghanistan Hanks entire Philip Seymor Hoffman especially rather humorous tone So much adding comedy genre would appropriate But story tell maybe ending serious indeed A story happened questions might So works comedy want one much serious one want Something everybody ? 7 10 ================================================== Topic 3 100.0% Primal Species comes B Movie legend Roger Corman everybody watches needs realise Low Budget B Movie knows A bunch terrorists high jack Lorry kill entire army believe hold uranium No It contains two Dino taste Human Flesh Then Crack team might well called Delta Force get called OK This Jurassic Park Yes The Dino never clearly seen obviously guy Costume dissimilar Barney Dinosaur slightly LESS terrifying come guys 1 Jurassic Park Budget Does deserve bottom 100 ? HELL NO ! ! ! I think nearly half voters give 1 WAY WAY overly harsh much closer 4 actually lot better whole host movies Bottom 100 similar production value Sci Fi Channel Production Movies get overly harsh time critics IMDb The acting expected B Movie although none actors take seriously neither scriptAll All enjoyable B Movie Not Film Snobs 100.0% This one worst movies ever come Sci Fi Channel Here starts Women humans planet due fact distant future chemical warfare A OK long targets soldiers In case wondering Men However virus back fires Big shock men earth slowly die Then male kind condemned die madam president shot killed man taken around 60 70 years two female scientists working cloning female baby one says Hey bring men back ? The one says world ready promptly ignores thus man walks Eath First assumes men genetically altered blood thirsty monsters Secondly writer forgot mention present day soldiers good mix Male Female officers real reason virus like This biggest waist time find This managed insult intellect bad story Lifetime style acting Avoid costs I give 1 10 I could go lower 100.0% Every time I think I feel physically ill To read great book later discover great feeling Years later imagine joy switching sci fi channel finding starts 5mins ! ! ! Up go titles uggg If couple things changed OK Everything changed Numerous characters removed entirely new rubbish ones added The main hero shrunk de aged 30 years hilariously girlfriend wife mother ! Even dog reduced sub lassie capabilities This truly appalling cinema absolute worst I would quite happily remove toenails pliers rather sit another horrific viewing I urge anyone thinking watching please If copy burn right think much better life would celluloid insult never occurred 100.0% First since I one people never saw MST3K chopped version I comment However I DID see original version Sci Fi Channel I thought good anything else Channel In fact I thought one better offerings I noticed perusing comments people write detail SOULTAKER modicum intelligence thoughtfulness maturity tend like least FEW things rightly In original cut reasonable people I think would probably rate least 4 5 stars 10 Five average I think average Sci Fi pic In contrast I also noticed reviewers seem immature dull flip result come boneheads I stand ones find anything good basically trash without cause based MOSTLY seeing chopped fricasseed MST3K Or seen cuts seems greatly prejudiced MST3K viewing begin 100.0% I see lot folks site wishing AG would come DVD Well I bought DVD From Borders less ! While great terrific show boxed DVD form I upset fact added way extra A director commentary Shaun Cassidy Pilot episode episodes shown order put TV The missing episodes never shown prior run Sci Fi channel box set tacked final DVD If buy DVD set get actual order viewed happier You need swap DVD player see order glad S ================================================== Topic 4 100.0% Heh I surprised still exists form let alone available rent ! This flick one many bad slasher flicks exist T A cheap laughs The story line crosses bit Texas Chainsaw massacre screwy mamma centred back story reminiscent Psycho bit good old women chains tough nail ex con broads tossed good measure words complete unoriginality wrapped half naked women spiced dash utter idiocy ! Looking director attempts make marsh land Quebec pass Southern U S bayou land sad I tell ya ! Funny thing I actually premier flick time I pals Ratchford star It painful watch Jeremy sank seat whilst flick unfolded mangled wings I happy see Ratchford sham first flick grown one hell actor He seen regularly Canadian cop drama Blue Murder appeared CSI mention role classic Clint Eastwood Unforgiven forgive ya Jeremy ! It rocky start done good man ! T Paul 100.0% This one best horror movies seen An eerie abandon house interesting characters gore twisted plot Who could ask anything horror ? It pretty predictable part horrors figure within first 10 minutes I hold The music camera angles forth excellent The sets well make convincing There pretty much subplots however horror many alternate plots take away wanting horror anyhow To scared This one keeps pretty simple If I compare I would say reminded remake Texas Chainsaw Massacre Definitely horror lover must see 100.0% Before films like The Texas Chainsaw Massacre Suspiria Halloween changed view horror forever Gothic far less violent era genre Films like Hammer Horror series Rosemary Baby scared thrilled audiences throughout 60s early 70s I tell many times I rented childhood I something I want limit slasher zombies movies 70s 80s films like production famous sadly long gone Amicus company good start Pros A grand eerie music score Strong performances stellar cast Brilliant cinematography Plenty good old fashioned thrills chills especially first last vignettes Some haunting moments images Moves slow stead pace The house one spooky oppressive dwelling Great production design set decoration give real old Gothic horror feel Depends mood bloodless chills gore gratuitous nudity thrills Cons Some pacing issues first half Aside The Cloak rest stories feel like done Clichés galore The second story Waxworks fine acting moments weakest four terms scares suspense The low budget really shows times Final thoughts After seeing first time many years I see I rented frequently It masterpiece means good example time horror films made style class Watch one lights My rating 3 5 5 100.0% I first saw one afternoon 80 network T V I think I like 9 Picture seeing violent horror flick nowadays regular television Anyway I seen years later like I remembered really good scary flick I think reason might gone unnoticed cause followed load sequels e Friday 13th But one movies takes original idea better Even though killer woods flick like Friday common original Texas Chainsaw Massacre That lays certain atmosphere feeling dread even broad daylight And killers feel threatening Friday There also good amount suspense I recommend seeing released DVD late July 100.0% This watchable nothing special Four girls road trip Vegas foolishly decide pick hitchhiker cute They end staying night motel middle nowhere hitchhiker psychotic issues women become apparent The characters clichés married responsible woman slutty party girl unsure bride man hater got dumped The hitchhiker genuinely nice goes crazy There nearly enough gore way much rape I enjoy slasher horror thrillers lot one nothing The ending lame rest On positive side actors great job work The dialogue awful overall I impressed cast never seen heard And plot realm possibility although I really doubt woman day age would pick hitchhiker matter attractive I groaning things make sense Overall The Hitchhiker well acted made sense interesting There lot better movies genre I would recommend one Rest Stop The Devil Rejects Texas Chainsaw Massacre even The Hitcher remake Do favor skip unless options ==================================================
H_reviews_df = pd.DataFrame(H_reviews, columns=reviews_tf_idf.columns)
H_reviews_df
| abbott costello meet | absolut noth els | absolut noth go | absolut noth happen | absolut noth new | absolut sens whatsoev | academi award actor | academi award nomin | academi award win | academi award winner | accord dvd sleev | act actual pretti | act camera work | act direct cinematographi | act direct script | act direct write | act entir cast | act everyon involv | act first rate | act horribl plot | act low budget | act poor direct | act poor script | act pretti much | act product valu | act special effect | act stori line | act sub par | act terribl plot | act terribl script | act top notch | act two lead | act write direct | action fight scene | action sci fi | action set piec | action special effect | actor actress play | actor deliv line | actor fine job | ... | without give much | without know anyth | without say anyth | witti one liner | woman fall love | wonder mani peopl | wong kar wai | word come mind | work low budget | work mani level | world trade center | world war ii | world war one | world war two | wors special effect | write act direct | write direct act | write ten line | writer actor director | writer director actor | writer director john | writer director produc | writer produc director | written direct act | written direct produc | written produc direct | wrong place wrong | wrong side track | wrong wrong wrong | www imdb com | yadda yadda yadda | ye read right | yet anoth exampl | yet still manag | young boy name | young girl name | young man name | young men women | young mr lincoln | young woman name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.011597e-06 | 0.000050 | 0.000036 | 0.000000 | 0.000055 | 1.780347e-08 | 0.000000 | 0.010505 | 0.021751 | 0.000107 | 0.000006 | 1.794317e-05 | 1.057032e-04 | 0.000000e+00 | 0.000000 | 0.000009 | 0.000018 | 7.639206e-07 | 0.000094 | 2.302743e-05 | 0.000000 | 3.765011e-07 | 1.117086e-05 | 0.000068 | 0.000000 | 0.000004 | 1.333444e-05 | 4.762242e-07 | 0.000097 | 1.767009e-06 | 0.000895 | 0.000012 | 0.000041 | 0.000017 | 0.000017 | 1.543515e-04 | 0.000045 | 9.140824e-05 | 0.000221 | 0.000000e+00 | ... | 0.000138 | 0.000020 | 0.000025 | 0.000007 | 0.000021 | 0.000075 | 0.000051 | 1.358089e-04 | 0.000044 | 0.000000 | 0.032463 | 0.000000 | 0.000026 | 0.000056 | 5.683246e-07 | 0.000000 | 4.373246e-05 | 9.379325e-07 | 0.000000 | 1.176556e-06 | 0.000051 | 2.082379e-07 | 0.000040 | 0.000000 | 0.000000 | 0.008120 | 0.000039 | 0.000042 | 4.599003e-05 | 1.717878e-06 | 5.075006e-07 | 0.000002 | 0.000075 | 0.000026 | 1.994537e-03 | 0.000045 | 0.005360 | 0.000000e+00 | 2.534314e-07 | 0.006519 |
| 1 | 1.632779e-04 | 0.000000 | 0.000017 | 0.000009 | 0.000000 | 1.380938e-07 | 0.012789 | 0.019902 | 0.007292 | 0.000050 | 0.008990 | 1.579916e-07 | 0.000000e+00 | 0.000000e+00 | 0.000040 | 0.000000 | 0.000072 | 2.006830e-05 | 0.000000 | 2.127209e-07 | 0.000000 | 3.155685e-08 | 0.000000e+00 | 0.000011 | 0.000000 | 0.000048 | 4.126654e-05 | 2.897041e-05 | 0.000000 | 7.104034e-07 | 0.000084 | 0.000014 | 0.000075 | 0.000011 | 0.000001 | 3.224916e-05 | 0.000049 | 1.717050e-07 | 0.000062 | 9.380862e-05 | ... | 0.000011 | 0.000053 | 0.004218 | 0.000001 | 0.000000 | 0.000000 | 0.000077 | 2.808548e-08 | 0.000000 | 0.008516 | 0.000000 | 3.251543 | 0.000251 | 0.001451 | 4.045972e-08 | 0.000042 | 4.535117e-07 | 1.265727e-06 | 0.000043 | 7.171635e-07 | 0.000002 | 2.150634e-05 | 0.000103 | 0.000000 | 0.000004 | 0.000000 | 0.000036 | 0.000047 | 1.069007e-07 | 3.479986e-07 | 0.000000e+00 | 0.000001 | 0.000000 | 0.000055 | 3.943674e-07 | 0.000007 | 0.000000 | 2.659683e-04 | 2.342776e-05 | 0.000000 |
| 2 | 4.935266e-07 | 0.000081 | 0.000002 | 0.006309 | 0.000017 | 1.228310e-05 | 0.010923 | 0.009080 | 0.000357 | 0.000214 | 0.000057 | 1.465559e-06 | 4.571123e-09 | 7.392944e-07 | 0.000036 | 0.000024 | 0.000001 | 1.147663e-06 | 0.014990 | 8.044076e-07 | 0.000000 | 2.539082e-06 | 6.844625e-07 | 0.000034 | 0.000028 | 0.018470 | 2.421646e-07 | 0.000000e+00 | 0.000071 | 1.519888e-06 | 0.000100 | 0.000180 | 0.000049 | 0.000002 | 0.000001 | 8.360666e-08 | 0.000043 | 6.220833e-05 | 0.000051 | 7.388931e-07 | ... | 0.004214 | 0.000035 | 0.000000 | 0.000007 | 0.000001 | 0.000046 | 0.000003 | 1.573718e-06 | 0.004093 | 0.000071 | 0.015654 | 0.000000 | 0.000002 | 0.000036 | 4.848349e-07 | 0.000102 | 1.896925e-06 | 6.001771e-07 | 0.000003 | 8.411770e-05 | 0.000102 | 2.516709e-04 | 0.000004 | 0.009501 | 0.000000 | 0.000000 | 0.000001 | 0.000016 | 1.952010e-06 | 3.135347e-05 | 9.295822e-07 | 0.000028 | 0.000041 | 0.000023 | 2.108386e-03 | 0.014146 | 0.000166 | 0.000000e+00 | 4.708371e-05 | 0.007109 |
| 3 | 6.740880e-05 | 0.000040 | 0.000072 | 0.000012 | 0.000020 | 4.425264e-04 | 0.000095 | 0.000096 | 0.000081 | 0.000203 | 0.000191 | 4.562544e-03 | 7.325297e-04 | 3.020235e-04 | 0.000005 | 0.000001 | 0.000073 | 8.420632e-05 | 0.000010 | 4.607683e-06 | 0.000513 | 2.867223e-05 | 1.328533e-04 | 0.000029 | 0.000000 | 0.016887 | 7.707089e-05 | 7.244013e-06 | 0.000103 | 5.539756e-04 | 0.000018 | 0.000002 | 0.017331 | 0.000077 | 0.000406 | 2.897428e-04 | 0.000031 | 1.254597e-04 | 0.000012 | 9.548648e-06 | ... | 0.000017 | 0.000043 | 0.000002 | 0.000024 | 0.000003 | 0.000004 | 0.000158 | 9.786820e-06 | 0.000101 | 0.000312 | 0.000108 | 0.000000 | 0.000094 | 0.000139 | 1.245461e-04 | 0.021213 | 1.301180e-04 | 3.319511e-06 | 0.000203 | 1.133455e-05 | 0.000010 | 3.360664e-04 | 0.000339 | 0.000246 | 0.000000 | 0.000000 | 0.000181 | 0.000008 | 1.070739e-05 | 5.524437e-05 | 6.898201e-06 | 0.000005 | 0.000011 | 0.000049 | 6.886625e-05 | 0.000048 | 0.000007 | 6.210196e-08 | 2.004045e-06 | 0.000074 |
| 4 | 4.224777e-04 | 0.015056 | 0.000039 | 0.000766 | 0.000630 | 1.941242e-04 | 0.000080 | 0.000073 | 0.000822 | 0.000770 | 0.000509 | 9.249635e-05 | 4.591993e-04 | 2.639385e-04 | 0.000202 | 0.002472 | 0.000578 | 1.040743e-04 | 0.002044 | 3.804330e-04 | 0.025572 | 3.390718e-04 | 3.230952e-04 | 0.000192 | 0.024670 | 0.019806 | 1.881933e-04 | 7.299824e-04 | 0.011532 | 1.209174e-04 | 0.001278 | 0.001282 | 0.000901 | 0.000453 | 0.000241 | 9.196825e-04 | 0.000111 | 7.537916e-05 | 0.004899 | 3.999201e-04 | ... | 0.000364 | 0.000182 | 0.000310 | 0.000104 | 0.000475 | 0.000524 | 0.000323 | 3.978136e-04 | 0.018346 | 0.000448 | 0.000151 | 0.000000 | 0.000240 | 0.002473 | 3.447149e-04 | 0.000980 | 4.854458e-04 | 2.766699e-05 | 0.000320 | 4.219117e-04 | 0.001300 | 1.496374e-03 | 0.000408 | 0.001676 | 0.004124 | 0.023766 | 0.000102 | 0.000102 | 1.312869e-04 | 7.935730e-05 | 6.325361e-05 | 0.000154 | 0.000293 | 0.000233 | 8.982440e-03 | 0.005845 | 0.000935 | 1.169069e-03 | 1.295810e-04 | 0.000785 |
5 rows × 2011 columns
occurence = []
for i in range(len(H_reviews_df)):
occurence.append(H_reviews_df.loc[i,:].to_dict())
# This snippet creating wordcloud comes from my HW of DSO528
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
#________________________________________________________________________
def random_color_func(word=None, font_size=None, position=None,
orientation=None, font_path=None, random_state=None):
h = int(360.0 * tone / 255.0)
s = int(100.0 * 255.0 / 255.0)
l = int(100.0 * float(random_state.randint(70, 120)) / 255.0)
return "hsl({}, {}%, {}%)".format(h, s, l)
#________________________________________________________________________
def make_wordcloud(liste, increment):
ax1 = fig.add_subplot(4,2,increment)
words = dict()
trunc_occurences = liste[0:150]
for s in trunc_occurences:
words[s[0]] = s[1]
#________________________________________________________
wordcloud = WordCloud(width=1000,height=400, background_color='lightgrey',
max_words=1628,relative_scaling=0,
color_func = random_color_func,
normalize_plurals=False)
wordcloud.generate_from_frequencies(words)
ax1.imshow(wordcloud, interpolation="bilinear")
ax1.axis('off')
plt.title('cluster nº{}'.format(increment-1))
#________________________________________________________________________
fig = plt.figure(1, figsize=(100,100))
color = [0, 160, 130, 95, 280, 40, 330, 110, 25]
n_clusters = 5
for i in range(n_clusters):
list_cluster_occurences = occurence[i]
tone = color[i] # define the color of the words
liste = []
for key, value in list_cluster_occurences.items():
liste.append([key, value])
liste.sort(key = lambda x:x[1], reverse = True)
make_wordcloud(liste, i+1)
We can see from wordcloud above, there are 5 topics in the review dataset.